The goal of this project is to investigate how partnerships involving multiple top-tier players in the NBA impacts various performance measures and team outcomes. Among the research questions we would like to explore are the following:
[ INSERT RESEARCH QUESTIONS HERE ]
To be able to investigate, we need to pull data from multiple NBA seasons. The script below provides code to create functions that pull traditional stats for every player for a given user-defined season.
Load necessary libraries
<<<<<<< Updated upstreamlibrary(rvest)
library(dplyr)
library(tidyverse)
── Attaching core tidyverse packages ─────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ readr 2.1.5
✔ ggplot2 3.5.1 ✔ stringr 1.5.1
✔ lubridate 1.9.3 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
=======
library(rvest)
library(dplyr)
library(tidyverse)
library(httr)
>>>>>>> Stashed changes
Function to get NBA roster for a specified year
get_nba_roster <- function(year) {
# Construct the URL for the specified year
url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_per_game.html")
# Read the HTML content from the URL
webpage <- read_html(url)
# Extract the table containing the player statistics
roster_table <- webpage %>%
html_node("table#per_game_stats") %>%
html_table(fill = TRUE)
# Clean the data (remove header rows that might be duplicated)
roster_table <- roster_table %>%
filter(Player != "Player")
return(roster_table)
}
Example usage
year <- 2018 # Specify the year
nba_roster <- get_nba_roster(year)
#Print the first few rows of the roster
head(nba_roster)
tail(nba_roster)
NA
#Summary statistics
position_roster<-filter(nba_roster,Pos!="PG" )
position_roster
library(plotly)
# Assuming 'nba_roster' is your data frame
input <- nba_roster[, c('MP', 'PTS', 'Player','Pos')]
# Create the plotly scatter plot
fig <- plot_ly(input, x = ~MP, y = ~PTS, type = 'scatter', mode = 'markers',
text = ~Player, # This adds player names on hover
hoverinfo = 'text', # Ensures that only player names appear on hover
color = ~Pos, # Colors points based on position
marker = list(size = 10))
# Set the plot title and axis labels
fig <- fig %>% layout(title = "Minutes Played vs Points Scored",
xaxis = list(title = "Minutes Played", range = c(0, 48)),
yaxis = list(title = "Points", range = c(0, 35)))
# Show the plot
fig
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
#Data Visualization for Minutes Played vs Points Scored
input <- nba_roster[, c('MP', 'PTS')]
print(head(input))
# Get the input values.
input <- nba_roster[, c('MP', 'PTS')]
# Plot the chart for cars with
# weight between 1.5 to 4 and
# mileage between 10 and 25.
plot(x = input$MP, y = input$PTS,
xlab = "Minutes Played",
ylab = "Points",
xlim = c(0.0, 48),
ylim = c(0.0, 35),
main = "Minutes Played vs Points Scored"
)
#Data Visualization for Field Goals Attempled vs Field Goals Made
input_2 <- nba_roster[, c('FGA', 'FG')]
print(head(input_2))
# Get the input values.
input_2 <- nba_roster[, c('FGA', 'FG')]
# Plot the chart for players with
# field goal attempts between 0.0 to 25.0 and
b_FG<-max(input_2$FG,na.rm=T)
b_FGA<-max(input_2$FGA,na.rm=T)
# Field Goal Made between 0.0 and 25.0
plot(x = input_2$FGA, y = input_2$FG,
xlab = "Field Goal Attempts",
ylab = "Field Goal Made",
xlim = c(0.0, b_FGA),
ylim = c(0.0, b_FG),
main = "Field Goal Attempt vs Field Goal Made"
)
NA
NA
# Create the data for the chart
A <- c(nba_roster$PTS)
B <- c("PF", "PG", "SF", "C", "SG")
# Plot the bar chart
ggplot(nba_roster, aes(x=Pos, fill=PTS))+
geom_bar()+
theme_classic(16)+
xlab("Position")+
ylab("Points")
Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
ASSIGNMENT 1: Is the data “clean”? Are there any missing values to be accounted for/addressed? If there are any data quality issues,
Initial thought to change the default character to double given that we have fractioned values. i think the columns should be changed from ” chararcter” to “double”
b. justify the validity of your approach removing observations with missing data from the dataset, using the function “na.omit” which will remove rows with missing values from our dataset
c. implement your proposed changes
For players who are missing data
nba_roster<-na.omit(nba_roster)
nba_roster
# Convert specific columns from character to double
# Convert all character columns to double
nba_roster %>%
mutate(across(G:PTS, as.numeric))
NA
NA
NA
To determine whether a player is “top tier” and should be considered a part of a “Big 3” lineup, other authors have transformed traditional stats to create metrics such as
PRA = POINTS + REBOUNDS + ASSISTS
We will consider advanced statistics such as PLAYER EFFIFIENCY RATING:
PER = (PTS + REB + AST + STL + BLK − Missed FG − Missed FT - TO) /GP
In particular, Value over Replacement (VORP) seems to do a solid job of identifying the best players in the league.
The script below provide code to create functions that pull advanced stats for every player for a given user-defined season.
# Function to get NBA advanced stats for a specified year
get_nba_advanced_stats <- function(year) {
# Construct the URL for the specified year
url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
# Read the HTML content from the URL
webpage <- read_html(url)
# Extract the table containing the advanced player statistics
advanced_stats_table <- webpage %>%
html_node("table#advanced_stats") %>%
html_table(fill = TRUE)
# Clean the data (remove header rows that might be duplicated)
# advanced_stats_table <- advanced_stats_table %>%
# filter(Player != "Player")
return(advanced_stats_table)
}
# Example usage
year <- 2018 # Specify the year
nba_advanced_stats <- get_nba_advanced_stats(year)
# Print the first few rows of the advanced stats
head(nba_advanced_stats)
NA
NA
ASSIGNMENT 2: Is the advanced data “clean”? Are there any missing values to be accounted for/addressed? If there are any data quality issues,
a. propose a method to resolve them
b. justify the validity of your approach
c. implement your proposed changes
cleaning similar to first one
The script below provide code to clean out the quality issues presented in the dataframe
#1 We want to order the athletes name to alphabetical order to clean out the filler headers present
newdataframe<- dataframe[order(dataframe$Player)]
Error: object 'dataframe' not found
#want to order by alphabetic name to make cleaning out the filler headers from the dataset
AO_nba_advanced_stats <- nba_advanced_stats[order(nba_advanced_stats$Player),]
# remove filler rows that had been previously used as headers on webpage
AO_nba_advanced_stats<- AO_nba_advanced_stats[-c(502:526), ]
#remove na from dataframe
AO_nba_advanced_stats %>%
select(where(~!all(is.na(.))))
#removing column 20 and 25 from dataframe since theyre blanks
AO_nba_advanced_stats<-AO_nba_advanced_stats[,-20]
AO_nba_advanced_stats<-AO_nba_advanced_stats[,-24]
#change range of cloumns <dbl> from <chr>
AO_nba_advanced_stats %>%
mutate(across(G:VORP, as.numeric))
Warning: There were 22 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(G:VORP, as.numeric)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 21 remaining warnings.
ASSIGNMENT 3: Merge the cleaned up datasets to create one new data frame with the traditional and advanced stats.
<<<<<<< Updated upstream#how to merge two files into one new data frame
nba_merge<-merge(nba_roster, AO_nba_advanced_stats, by = c("Rk", "Player", "Pos","Age", "Tm","G"))#, by.y = c("Rk", "Player", "Pos","Age", "Tm","G") , all.x = TRUE, all.y = TRUE)
=======
#Get NBA Totals Statistics
get_nba_totals_stats <- function(year) {
url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_totals.html")
webpage <- read_html(url)
totals_stats_table <- webpage %>%
html_node("table#totals_stats") %>%
html_table(fill = TRUE)
# Clean up column names
colnames(totals_stats_table) <- make.names(colnames(totals_stats_table), unique = TRUE)
# Clean the data
totals_stats_table <- totals_stats_table %>%
filter(!is.na(Player) & Player != "Player") # Ensure no NA or duplicate header rows
return(totals_stats_table)
}
get_nba_advanced_stats <- function(year) {
url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
# Fetch webpage
webpage <- tryCatch({
read_html(GET(url, user_agent("Mozilla/5.0")))
}, error = function(e) {
stop("Error fetching webpage: ", e$message)
})
# Extract table with updated ID
advanced_stats_table <- webpage %>%
html_node("table#advanced") %>% # Updated selector to match the new ID
html_table(fill = TRUE)
# Clean up column names
colnames(advanced_stats_table) <- make.names(colnames(advanced_stats_table), unique = TRUE)
# Clean the data
advanced_stats_table <- advanced_stats_table %>%
filter(!is.na(Player) & Player != "Player") # Remove NA rows and duplicate headers
return(advanced_stats_table)
}
>>>>>>> Stashed changes
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
ASSIGNMENT 4: Make a function with argument
year that outputs one dataframe with the merged traditional
and advanced data.
<<<<<<< Updated upstream
combined_nba_stats<-function(year){
get_nba_roster2 <- function(year) {
# Construct the URL for the specified year
url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_per_game.html")
# Read the HTML content from the URL
webpage <- read_html(url)
=======
#Get NBA Totals Statistics
get_nba_totals_stats <- function(year) {
url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_totals.html")
webpage <- read_html(url)
totals_stats_table <- webpage %>%
html_node("table#totals_stats") %>%
html_table(fill = TRUE)
# Clean up column names
colnames(totals_stats_table) <- make.names(colnames(totals_stats_table), unique = TRUE)
# Clean the data
totals_stats_table <- totals_stats_table %>%
filter(!is.na(Player) & Player != "Player") # Ensure no NA or duplicate header rows
return(totals_stats_table)
}
get_nba_advanced_stats <- function(year) {
url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
# Fetch webpage
webpage <- tryCatch({
read_html(GET(url, user_agent("Mozilla/5.0")))
}, error = function(e) {
stop("Error fetching webpage: ", e$message)
})
# Extract table with updated ID
advanced_stats_table <- webpage %>%
html_node("table#advanced") %>% # Updated selector to match the new ID
html_table(fill = TRUE)
# Clean up column names
colnames(advanced_stats_table) <- make.names(colnames(advanced_stats_table), unique = TRUE)
# Clean the data
advanced_stats_table <- advanced_stats_table %>%
filter(!is.na(Player) & Player != "Player") # Remove NA rows and duplicate headers
return(advanced_stats_table)
}
get_cleaned_nba_stats <- function(year) {
# Fetch totals and advanced stats
nba_totals <- get_nba_totals_stats(year)
nba_advanced <- get_nba_advanced_stats(year)
# Print to check datasets (Optional)
print("NBA Totals:")
print(head(nba_totals))
print("NBA Advanced:")
print(head(nba_advanced))
>>>>>>> Stashed changes
# Extract the table containing the player statistics
roster_table <- webpage %>%
html_node("table#per_game_stats") %>%
html_table(fill = TRUE)
# Clean the data (remove header rows that might be duplicated)
roster_table <- roster_table %>%
filter(Player != "Player")
return(roster_table)
}
year <- 2023 # Specify the year
nba_roster2 <- get_nba_roster2(year)
#Print the first few rows of the roster
head(nba_roster)
tail(nba_roster)
<<<<<<< Updated upstream
#take out the N/A
nba_roster2<-na.omit(nba_roster2)
=======
# Debug: Show merged data sample
print("Merged Data Sample:")
print(head(nba_merge))
# Check column names to confirm 'Team' exists
print("Column Names Before Renaming:")
print(colnames(nba_merge))
# Clean 'Team' column: rename 'Tm' to 'Team' if present
if ("Tm" %in% colnames(nba_merge)) {
nba_merge <- nba_merge %>%
rename(Team = Tm)
}
# Debug: Show column names after renaming
print("Column Names After Renaming 'Tm' to 'Team':")
print(colnames(nba_merge))
# Handle duplicate columns like Team.x, Team.y, Awards.x, Awards.y
duplicate_columns <- colnames(nba_merge)[grepl("\\.x$", colnames(nba_merge))]
for (col in duplicate_columns) {
# Extract the base name of the column (e.g., "Team" from "Team.x")
base_col <- sub("\\.x$", "", col)
# Merge the .x and .y columns into one
if (paste0(base_col, ".y") %in% colnames(nba_merge)) {
nba_merge <- nba_merge %>%
mutate(!!base_col := coalesce(get(col), get(paste0(base_col, ".y")))) %>%
select(-all_of(c(col, paste0(base_col, ".y")))) # Drop the old columns
}
}
>>>>>>> Stashed changes
# Convert specific columns from character to double
<<<<<<< Updated upstream
nba_roster2 %>%
mutate(across(G:PTS, as.numeric))
#ADVANCED STATS
# Function to get NBA advanced stats for a specified year
get_nba_advanced_stats <- function(year) {
# Construct the URL for the specified year
url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
# Read the HTML content from the URL
webpage <- read_html(url)
# Extract the table containing the advanced player statistics
advanced_stats_table <- webpage %>%
html_node("table#advanced_stats") %>%
html_table(fill = TRUE)
# Clean the data (remove header rows that might be duplicated)
# advanced_stats_table <- advanced_stats_table %>%
# filter(Player != "Player")
return(advanced_stats_table)
}
# Example usage
year <- 2023 # Specify the year
nba_advanced_stats2<- get_nba_advanced_stats(year)
# Print the first few rows of the advanced stats
head(nba_advanced_stats2)
#want to order by alphabetic name to make cleaning out the filler headers from the dataset
AO_nba_advanced_stats2<- nba_advanced_stats2[order(nba_advanced_stats2$Player),]
#remove na from dataframe
AO_nba_advanced_stats2 %>%
select(where(~!all(is.na(.))))
#removing column 20 and 25 from dataframe since theyre blanks
AO_nba_advanced_stats2<-AO_nba_advanced_stats2[,-20]
AO_nba_advanced_stats2<-AO_nba_advanced_stats2[,-24]
# remove filler rows that had been previously used as headers on webpage
AO_nba_advanced_stats2 <- AO_nba_advanced_stats2[AO_nba_advanced_stats2$Player != "Player",]
AO_nba_advanced_stats2$Player <- factor(AO_nba_advanced_stats2$Player)
#change range of cloumns <dbl> from <chr>
AO_nba_advanced_stats2 %>%
mutate(across(G:VORP, as.numeric))
nba_merge<-merge(nba_roster2, AO_nba_advanced_stats2, by.x = c("Rk", "Player", "Pos","Age", "Tm","G"), by.y = c("Rk", "Player", "Pos","Age", "Tm","G") , all.x = TRUE, all.y = TRUE)
}
=======
# Remove the 'X', 'X.1' columns if they exist
columns_to_remove <- c("X", "X.1")
nba_merge <- nba_merge %>%
select(-any_of(columns_to_remove)) # Remove specified columns if they exist
# Merge 'Rk.x' and 'Rk.y' columns
if ("Rk.x" %in% names(nba_merge) & "Rk.y" %in% names(nba_merge)) {
nba_merge <- nba_merge %>%
mutate(Rk = coalesce(as.character(Rk.x), as.character(Rk.y))) %>%
select(-Rk.x, -Rk.y)
}
# Merge 'Age.x' and 'Age.y' columns
if ("Age.x" %in% names(nba_merge) & "Age.y" %in% names(nba_merge)) {
nba_merge <- nba_merge %>%
mutate(Age = coalesce(as.numeric(Age.x), as.numeric(Age.y))) %>%
select(-c(Age.x, Age.y))
}
# Reorder columns for clarity
column_order <- c("Player", "Pos", "Age", "Rk", "G", "MP", "Team")
nba_merge <- nba_merge %>%
select(all_of(column_order), everything())
# Return the cleaned and merged dataset
return(nba_merge)
}
# Example usage
nba_data_2013 <- get_cleaned_nba_stats(2013)
>>>>>>> Stashed changes
Error in value[[3L]](cond) :
Error fetching webpage: could not find function "GET"
<<<<<<< Updated upstream
=======
nba_data_1984<-get_cleaned_nba_stats(1984)
Error in open.connection(x, "rb") :
Could not resolve host: www.basketball-reference.com
>>>>>>> Stashed changes
ASSIGNMENT 5: Make this file more visually
appealng, with headers, bullet points, sections and subsections as you
see fit. You may consider migrating over to Quarto for this
reason.
<<<<<<< Updated upstream
---
title: "R Notebook"
output: html_notebook
---

The goal of this project is to investigate how partnerships involving multiple top-tier players in the NBA impacts various performance measures and team outcomes. Among the research questions we would like to explore are the following:

*[ INSERT RESEARCH QUESTIONS HERE ]*

To be able to investigate, we need to pull data from multiple NBA seasons. The script below provides code to create functions that pull traditional stats for every player for a given user-defined season.

Load necessary libraries
```{r}
library(rvest)
library(dplyr)
library(tidyverse)
```

Function to get NBA roster for a specified year
```{r}
get_nba_roster <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_per_game.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)


  # Extract the table containing the player statistics
  roster_table <- webpage %>%
    html_node("table#per_game_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
  roster_table <- roster_table %>%
    filter(Player != "Player")
    return(roster_table)
}
```

Example usage
```{r}
year <- 2018  # Specify the year
nba_roster <- get_nba_roster(year)

#Print the first few rows of the roster
head(nba_roster)
tail(nba_roster)

```
```{r}
#Summary statistics

position_roster<-filter(nba_roster,Pos!="PG" )
position_roster
```



```{r}

library(plotly)

# Assuming 'nba_roster' is your data frame
input <- nba_roster[, c('MP', 'PTS', 'Player','Pos')]

# Create the plotly scatter plot
fig <- plot_ly(input, x = ~MP, y = ~PTS, type = 'scatter', mode = 'markers',
               text = ~Player,  # This adds player names on hover
               hoverinfo = 'text', # Ensures that only player names appear on hover
               color = ~Pos,  # Colors points based on position
               marker = list(size = 10))

# Set the plot title and axis labels
fig <- fig %>% layout(title = "Minutes Played vs Points Scored",
                      xaxis = list(title = "Minutes Played", range = c(0, 48)),
                      yaxis = list(title = "Points", range = c(0, 35)))

# Show the plot
fig


```
```{r}
#Data Visualization for Minutes Played vs Points Scored
input <- nba_roster[, c('MP', 'PTS')]
print(head(input))

# Get the input values.
input <- nba_roster[, c('MP', 'PTS')]

# Plot the chart for cars with
# weight between 1.5 to 4 and
# mileage between 10 and 25.
plot(x = input$MP, y = input$PTS,
	xlab = "Minutes Played",
	ylab = "Points",
	xlim = c(0.0, 48),
	ylim = c(0.0, 35),	 
	main = "Minutes Played vs Points Scored"
)

```


```{r}
#Data Visualization for Field Goals Attempled vs Field Goals Made
input_2 <- nba_roster[, c('FGA', 'FG')]
print(head(input_2))

# Get the input values.
input_2 <- nba_roster[, c('FGA', 'FG')]

# Plot the chart for players with
# field goal attempts between 0.0 to 25.0 and
b_FG<-max(input_2$FG,na.rm=T)
b_FGA<-max(input_2$FGA,na.rm=T)
# Field Goal Made between 0.0 and 25.0
plot(x = input_2$FGA, y = input_2$FG,
	xlab = "Field Goal Attempts",
	ylab = "Field Goal Made",
	xlim = c(0.0, b_FGA),
	ylim = c(0.0, b_FG),	 
	main = "Field Goal Attempt vs Field Goal Made"
)


```



```{r}
# Create the data for the chart
A <- c(nba_roster$PTS)
B <- c("PF", "PG", "SF", "C", "SG")

# Plot the bar chart
ggplot(nba_roster, aes(x=Pos, fill=PTS))+  

geom_bar()+  

theme_classic(16)+  

xlab("Position")+  

ylab("Points") 

```


**ASSIGNMENT 1:** *Is the data "clean"? Are there any missing values to be accounted for/addressed? If there are any data quality issues,*

 - *a. propose a method to resolve them*
       
Initial thought to change the default character to double given that we have fractioned values. i think the columns should be changed from " chararcter" to "double" 


 - *b. justify the validity of your approach*
removing observations with missing data from the dataset, using the function "na.omit" which will remove rows with missing values from our dataset


 - *c. implement your proposed changes*


For players who are missing data
```{r}
nba_roster<-na.omit(nba_roster)

nba_roster




# Convert specific columns from character to double
# Convert all character columns to double
nba_roster %>%
   mutate(across(G:PTS, as.numeric))



```

To determine whether a player is "top tier" and should be considered a part of a "Big 3" lineup, other authors have transformed traditional stats to create metrics such as

PRA = POINTS + REBOUNDS + ASSISTS 

We will consider advanced statistics such as PLAYER EFFIFIENCY RATING:

PER = (PTS + REB + AST + STL + BLK − Missed FG − Missed FT - TO) /GP

In particular, Value over Replacement (VORP) seems to do a solid job of identifying the best players in the league.

The script below provide code to create functions that pull advanced stats for every player for a given user-defined season.
```{r}

# Function to get NBA advanced stats for a specified year
get_nba_advanced_stats <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)
  
  # Extract the table containing the advanced player statistics
  advanced_stats_table <- webpage %>%
    html_node("table#advanced_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
 # advanced_stats_table <- advanced_stats_table %>%
  #  filter(Player != "Player")
  
  return(advanced_stats_table)
}

# Example usage
year <- 2018  # Specify the year
nba_advanced_stats <- get_nba_advanced_stats(year)

# Print the first few rows of the advanced stats
head(nba_advanced_stats)


```

**ASSIGNMENT 2:** *Is the advanced data "clean"? Are there any missing values to be accounted for/addressed? If there are any data quality issues,*

 - *a. propose a method to resolve them*

 - *b. justify the validity of your approach*

 - *c. implement your proposed changes*
 
 cleaning similar to first one 
 
 
 The script below provide code to clean out the quality issues presented in the dataframe
 
 
 
```{r}
#1 We want to order the athletes name to alphabetical order to clean out the filler headers present

newdataframe<- dataframe[order(dataframe$Player)]

#2 Now we want to remove the filler rows that had been used as headers on the webpage

newdata.frame<-dataframe[-c(502:526), ]

#3 now we want to remove all the N/As from the dataset
dataframe %>% 
  select(where(~!all(is.na(.))))
```
 
 
 
```{r}

#want to order by alphabetic name to make cleaning out the filler headers from the dataset
AO_nba_advanced_stats <- nba_advanced_stats[order(nba_advanced_stats$Player),]



# remove filler rows that had been previously used as headers on webpage
AO_nba_advanced_stats<- AO_nba_advanced_stats[-c(502:526), ]


#remove na from dataframe
AO_nba_advanced_stats %>% 
  select(where(~!all(is.na(.))))
#removing column 20 and 25 from dataframe since theyre blanks
AO_nba_advanced_stats<-AO_nba_advanced_stats[,-20]
AO_nba_advanced_stats<-AO_nba_advanced_stats[,-24]


#change range of cloumns <dbl> from <chr>

AO_nba_advanced_stats %>%
   mutate(across(G:VORP, as.numeric))




```


**ASSIGNMENT 3:** *Merge the cleaned up datasets to create one new data frame with the traditional and advanced stats.*



```{r}
#how to merge two files into one new data frame
nba_merge<-merge(nba_roster, AO_nba_advanced_stats, by.x = c("Rk", "Player", "Pos","Age", "Tm","G"), by.y = c("Rk", "Player", "Pos","Age", "Tm","G") , all.x = TRUE, all.y = TRUE)


head(nba_merge)
```


**ASSIGNMENT 4:** *Make a function with argument `year` that outputs one dataframe with the merged traditional and advanced data.* 

```{r}

combined_nba_stats<-function(year){
get_nba_roster2 <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_per_game.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)


  # Extract the table containing the player statistics
  roster_table <- webpage %>%
    html_node("table#per_game_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
  roster_table <- roster_table %>%
    filter(Player != "Player")
    return(roster_table)
}
  
  year <- 2023  # Specify the year
nba_roster2 <- get_nba_roster2(year)

#Print the first few rows of the roster
head(nba_roster)
tail(nba_roster)


#take out the N/A 
nba_roster2<-na.omit(nba_roster2)


# Convert specific columns from character to double

nba_roster2 %>%
   mutate(across(G:PTS, as.numeric))

#ADVANCED STATS

# Function to get NBA advanced stats for a specified year
get_nba_advanced_stats <- function(year) {
  # Construct the URL for the specified year
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Read the HTML content from the URL
  webpage <- read_html(url)
  
  # Extract the table containing the advanced player statistics
  advanced_stats_table <- webpage %>%
    html_node("table#advanced_stats") %>%
    html_table(fill = TRUE)
  
  # Clean the data (remove header rows that might be duplicated)
 # advanced_stats_table <- advanced_stats_table %>%
  #  filter(Player != "Player")
  
  return(advanced_stats_table)
}

# Example usage
year <- 2023  # Specify the year
nba_advanced_stats2<- get_nba_advanced_stats(year)

# Print the first few rows of the advanced stats
head(nba_advanced_stats2)



#want to order by alphabetic name to make cleaning out the filler headers from the dataset
AO_nba_advanced_stats2<- nba_advanced_stats2[order(nba_advanced_stats2$Player),]





#remove na from dataframe
AO_nba_advanced_stats2 %>% 
  select(where(~!all(is.na(.))))
#removing column 20 and 25 from dataframe since theyre blanks
AO_nba_advanced_stats2<-AO_nba_advanced_stats2[,-20]
AO_nba_advanced_stats2<-AO_nba_advanced_stats2[,-24]


# remove filler rows that had been previously used as headers on webpage


AO_nba_advanced_stats2 <- AO_nba_advanced_stats2[AO_nba_advanced_stats2$Player != "Player",]

AO_nba_advanced_stats2$Player <- factor(AO_nba_advanced_stats2$Player)




#change range of cloumns <dbl> from <chr>

AO_nba_advanced_stats2 %>%
   mutate(across(G:VORP, as.numeric))
   
   nba_merge<-merge(nba_roster2, AO_nba_advanced_stats2, by.x = c("Rk", "Player", "Pos","Age", "Tm","G"), by.y = c("Rk", "Player", "Pos","Age", "Tm","G") , all.x = TRUE, all.y = TRUE)
   
}



```


**ASSIGNMENT 5:** *Make this file more visually appealng, with headers, bullet points, sections and subsections as you see fit. You may consider migrating over to Quarto for this reason.*



=======
---
title: "R Notebook"
output: html_notebook
---

The goal of this project is to investigate how partnerships involving multiple top-tier players in the NBA impacts various performance measures and team outcomes. Among the research questions we would like to explore are the following:

*[ INSERT RESEARCH QUESTIONS HERE ]*

To be able to investigate, we need to pull data from multiple NBA seasons. The script below provides code to create functions that pull traditional stats for every player for a given user-defined season.

Load necessary libraries
```{r}
library(rvest)
library(dplyr)
library(tidyverse)
library(httr)

```

Function to get NBA roster for a specified year


Example usage





**ASSIGNMENT 1:** *Is the data "clean"? Are there any missing values to be accounted for/addressed? If there are any data quality issues,*

 - *a. propose a method to resolve them*
       
Initial thought to change the default character to double given that we have fractioned values. i think the columns should be changed from " chararcter" to "double" 


 - *b. justify the validity of your approach*
removing observations with missing data from the dataset, using the function "na.omit" which will remove rows with missing values from our dataset


 - *c. implement your proposed changes*


**ASSIGNMENT 2:** *Is the advanced data "clean"? Are there any missing values to be accounted for/addressed? If there are any data quality issues,*

 - *a. propose a method to resolve them*

 - *b. justify the validity of your approach*

 - *c. implement your proposed changes*
 
 cleaning similar to first one 
 
 
 The script below provide code to clean out the quality issues presented in the dataframe
 
 
 
```{r}

```
 
 
 
```{r}

```


**ASSIGNMENT 3:** *Merge the cleaned up datasets to create one new data frame with the traditional and advanced stats.*



```{r}
#Get NBA Totals Statistics
get_nba_totals_stats <- function(year) {
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_totals.html")
  webpage <- read_html(url)
  totals_stats_table <- webpage %>%
    html_node("table#totals_stats") %>%
    html_table(fill = TRUE)

  # Clean up column names
  colnames(totals_stats_table) <- make.names(colnames(totals_stats_table), unique = TRUE)

  # Clean the data
  totals_stats_table <- totals_stats_table %>%
    filter(!is.na(Player) & Player != "Player")  # Ensure no NA or duplicate header rows
  
  return(totals_stats_table)
}


get_nba_advanced_stats <- function(year) {
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Fetch webpage
  webpage <- tryCatch({
    read_html(GET(url, user_agent("Mozilla/5.0")))
  }, error = function(e) {
    stop("Error fetching webpage: ", e$message)
  })
  
  # Extract table with updated ID
  advanced_stats_table <- webpage %>%
    html_node("table#advanced") %>%  # Updated selector to match the new ID
    html_table(fill = TRUE)
  
  # Clean up column names
  colnames(advanced_stats_table) <- make.names(colnames(advanced_stats_table), unique = TRUE)
  
  # Clean the data
  advanced_stats_table <- advanced_stats_table %>%
    filter(!is.na(Player) & Player != "Player")  # Remove NA rows and duplicate headers
  
  return(advanced_stats_table)
}


```


**ASSIGNMENT 4:** *Make a function with argument `year` that outputs one dataframe with the merged traditional and advanced data.* 


Official Cleaning Function that works as of 10/29/2024
```{r}

#Get NBA Totals Statistics
get_nba_totals_stats <- function(year) {
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_totals.html")
  webpage <- read_html(url)
  totals_stats_table <- webpage %>%
    html_node("table#totals_stats") %>%
    html_table(fill = TRUE)

  # Clean up column names
  colnames(totals_stats_table) <- make.names(colnames(totals_stats_table), unique = TRUE)

  # Clean the data
  totals_stats_table <- totals_stats_table %>%
    filter(!is.na(Player) & Player != "Player")  # Ensure no NA or duplicate header rows
  
  return(totals_stats_table)
}


get_nba_advanced_stats <- function(year) {
  url <- paste0("https://www.basketball-reference.com/leagues/NBA_", year, "_advanced.html")
  
  # Fetch webpage
  webpage <- tryCatch({
    read_html(GET(url, user_agent("Mozilla/5.0")))
  }, error = function(e) {
    stop("Error fetching webpage: ", e$message)
  })
  
  # Extract table with updated ID
  advanced_stats_table <- webpage %>%
    html_node("table#advanced") %>%  # Updated selector to match the new ID
    html_table(fill = TRUE)
  
  # Clean up column names
  colnames(advanced_stats_table) <- make.names(colnames(advanced_stats_table), unique = TRUE)
  
  # Clean the data
  advanced_stats_table <- advanced_stats_table %>%
    filter(!is.na(Player) & Player != "Player")  # Remove NA rows and duplicate headers
  
  return(advanced_stats_table)
}

get_cleaned_nba_stats <- function(year) {
  # Fetch totals and advanced stats
  nba_totals <- get_nba_totals_stats(year)
  nba_advanced <- get_nba_advanced_stats(year)

  # Print to check datasets (Optional)
  print("NBA Totals:")
  print(head(nba_totals))
  print("NBA Advanced:")
  print(head(nba_advanced))

  # Clean Player names in the advanced dataset: remove the asterisk and trim spaces
  nba_advanced <- nba_advanced %>%
    mutate(Player = trimws(gsub("\\*", "", Player)))  # Remove asterisk

  # Ensure that the advanced stats consider the cleaned player names
  nba_advanced <- nba_advanced %>%
    mutate(Player = trimws(Player))

  # Merge the datasets on Player, Pos, G, and MP
  nba_merge <- merge(nba_totals, nba_advanced, 
                     by = c("Player", "Pos", "G", "MP"), 
                     all.x = TRUE)

  # Debug: Show merged data sample
  print("Merged Data Sample:")
  print(head(nba_merge))
  
  # Check column names to confirm 'Team' exists
  print("Column Names Before Renaming:")
  print(colnames(nba_merge))

  # Clean 'Team' column: rename 'Tm' to 'Team' if present
  if ("Tm" %in% colnames(nba_merge)) {
    nba_merge <- nba_merge %>%
      rename(Team = Tm)
  }

  # Debug: Show column names after renaming
  print("Column Names After Renaming 'Tm' to 'Team':")
  print(colnames(nba_merge))

  # Handle duplicate columns like Team.x, Team.y, Awards.x, Awards.y
  duplicate_columns <- colnames(nba_merge)[grepl("\\.x$", colnames(nba_merge))]

  for (col in duplicate_columns) {
    # Extract the base name of the column (e.g., "Team" from "Team.x")
    base_col <- sub("\\.x$", "", col)
    
    # Merge the .x and .y columns into one
    if (paste0(base_col, ".y") %in% colnames(nba_merge)) {
      nba_merge <- nba_merge %>%
        mutate(!!base_col := coalesce(get(col), get(paste0(base_col, ".y")))) %>%
        select(-all_of(c(col, paste0(base_col, ".y"))))  # Drop the old columns
    }
  }

  # Remove players whose team is "TOT", "2Tm", or "3Tm"
  nba_merge <- nba_merge %>%
    filter(!grepl("^(TOT|2TM|3TM)$", Team))

  # Remove players with multiple positions
  nba_merge <- nba_merge %>%
    filter(!grepl("-", Pos))

  # Remove the 'X', 'X.1' columns if they exist
  columns_to_remove <- c("X", "X.1")
  nba_merge <- nba_merge %>%
    select(-any_of(columns_to_remove))  # Remove specified columns if they exist

  # Merge 'Rk.x' and 'Rk.y' columns
  if ("Rk.x" %in% names(nba_merge) & "Rk.y" %in% names(nba_merge)) {
    nba_merge <- nba_merge %>%
      mutate(Rk = coalesce(as.character(Rk.x), as.character(Rk.y))) %>%
      select(-Rk.x, -Rk.y)
  }

  # Merge 'Age.x' and 'Age.y' columns
  if ("Age.x" %in% names(nba_merge) & "Age.y" %in% names(nba_merge)) {
    nba_merge <- nba_merge %>%
      mutate(Age = coalesce(as.numeric(Age.x), as.numeric(Age.y))) %>%
      select(-c(Age.x, Age.y))
  }

  # Reorder columns for clarity
  column_order <- c("Player", "Pos", "Age", "Rk", "G", "MP", "Team")
  nba_merge <- nba_merge %>%
    select(all_of(column_order), everything())

  # Return the cleaned and merged dataset
  return(nba_merge)
}

# Example usage
nba_data_2013 <- get_cleaned_nba_stats(2013)

# View the first few rows of the cleaned dataset
head(nba_data_2013)


```


```{r}
nba_data_2022 <-get_cleaned_nba_stats(2022)
nba_data_2010 <-get_cleaned_nba_stats(2010)
nba_data_2015 <-get_cleaned_nba_stats(2015)
nba_data_2011<-get_cleaned_nba_stats(2011)
nba_data_2012<-get_cleaned_nba_stats(2012)
nba_data_2009<-get_cleaned_nba_stats(2009)
nba_data_2008<-get_cleaned_nba_stats(2008)
nba_data_2007<-get_cleaned_nba_stats(2007)
nba_data_2006<-get_cleaned_nba_stats(2006)
nba_data_2005<-get_cleaned_nba_stats(2005)
nba_data_2004<-get_cleaned_nba_stats(2004)
nba_data_2003<-get_cleaned_nba_stats(2003)
nba_data_2002<-get_cleaned_nba_stats(2002)
nba_data_2001<-get_cleaned_nba_stats(2001)
nba_data_2000<-get_cleaned_nba_stats(2000)



nba_data_1999<-get_cleaned_nba_stats(1999)
nba_data_1998<-get_cleaned_nba_stats(1998)
nba_data_1997<-get_cleaned_nba_stats(1997)
nba_data_1996<-get_cleaned_nba_stats(1996)
nba_data_1995<-get_cleaned_nba_stats(1995)
nba_data_1994<-get_cleaned_nba_stats(1994)
nba_data_1993<-get_cleaned_nba_stats(1993)
nba_data_1992<-get_cleaned_nba_stats(1992)
nba_data_1991<-get_cleaned_nba_stats(1991)
nba_data_1990<-get_cleaned_nba_stats(1990)
nba_data_1989<-get_cleaned_nba_stats(1989)
nba_data_1988<-get_cleaned_nba_stats(1988)
nba_data_1987<-get_cleaned_nba_stats(1987)
nba_data_1986<-get_cleaned_nba_stats(1986)
nba_data_1985<-get_cleaned_nba_stats(1985)
nba_data_1984<-get_cleaned_nba_stats(1984)
nba_data_1983<-get_cleaned_nba_stats(1983)
nba_data_1982<-get_cleaned_nba_stats(1982)
nba_data_1981<-get_cleaned_nba_stats(1981)
nba_data_1980<-get_cleaned_nba_stats(1980)

readLines("https://www.basketball-reference.com/teams/LAL/2023.html", n = 1)

```


**ASSIGNMENT 5:** *Make this file more visually appealng, with headers, bullet points, sections and subsections as you see fit. You may consider migrating over to Quarto for this reason.*


File locator
```{r}
# Save your dataframe as a CSV file
write.csv(nba_roster2, file = "generalstats.csv", row.names = FALSE)
write.csv(AO_nba_advanced_stats2, file = "advancedstats.csv", row.names = FALSE)
write.csv(nba_data_2023, file = "nba2023.csv", row.names = FALSE)
write.csv(nba_data_2013, file = "nba2013.csv", row.names = FALSE)
getwd()

```






>>>>>>> Stashed changes